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SUMMARY 

Recent developments in reliability modeling for fault tolerant avionic computing systems are pre- 
sented. Emphasis is placed on the modeling of large systems where issues of state size and complexity, 
fault coverage, and practical coinputsUion are addressed. 

A two-fold (analytical modeling) developmental effort is described based on the "structural modeling 1 
and "fault coverage modeling" approaches. With regard to the structural modeling effort, two techniques 
under study are examined. One technique which was successfully applied to a 865 state pure death stationary 
Markov model is presented. The modeling technique is applied to a fault tolerant multiprocessor currently 
under development. Of particular interest is a short computer program which executes very quickly to pro- 
duce reliability results of a large-state space model. Also, this model incorporates fault coverage states 
for processor, memory, and bus LRU's (Line Replaceable Unit). 

A second structural reliability modeling scheme which is aimed at solving nonstationary Markov models 
is discussed. This technique which is under development will provide the tool required for studying the 
reliability of systems with nonconstant failure rates and includes intermi ttent/transienv faults, elec- 
tronic hardware which exhibits decreasing failure rates, and hydromechanical devices which typically have 
wearout failure mechanisms. 

A general discussion of fault coverage and how it impacts svstem design is presented together with a 
historical account of the research which led to the current fault coverage developmental program. Several 
aspects of fault coverage including modeling and data measurement of intermittent/transient faults and 
latent faults are elucidated and illustrated. The CARE II (Computer-Aided Reliability Estimation) coverage 
is presented and shortcomings to be eliminated in the future CARE III are discussed. 

The emergence of the so-called latent fault as a significant factor in reliability assessment is 
gaining increased attention from a modeling viewpoint; therefore, nuances of latent faults, models for such, 
and a method for latent fault measurement are depicted. 


1. INTRODUCTION . ■ 

The importance of achieving a faithful reliability assessment capability for avionic fault tolerant 
systems cannot be overstressed. Reliability issues involve virtually every aspect of design, packaging, 
and field coerations, with regard to safety, maintainability, and invariably profits. Successful imple- 
mentation of digital fault tolerant computers for critical flight functions in commercial aircraft cannot 
be realized without rigorous and credible analytical and simulative demonstrations of system reliability 
and fault tolerance. This conviction is fostered by the observation and supported by analysis that life • 
testing to demonstrate the ultrareliability of these systems will be impractical, and because of the satety 
aspect, the full potential of such systems will not be realized until system reliability and fault 
tolerance are substantiated. 

The task of producing a credible reliability assessment capability is indeed a formidable one. The 
root of the problem is embodied in the very essence that makes the digital computer such an attractive 
device for use in a host of applications, namely its adaptability to changing requirements, computational 
power, and ability to test itself. 

Among the many factors to be considered in the design of fault tolerant systems are those which can 
have a direct impact on reliability. These factors must be accurately accounted for in a faithful relia- 
bility assessment. Figure 1 depicts some of the more important elements delineated into four categories: 
(1) Type and Manifestation, (2) Cause, (3) System Effect, and (4) Defense. Every digital avionic fault 
tolerant system must be designed to effectively cope with a myriad of hardware and software anomalies 
which are classified in categories 1 and 2. Categories 3 and 4 typify the effect of anomalies and some 
techniques for coping with them. Figure 2 portrays the combinations of categories 1 and 2. For example, 
a hardware anomaly could be a permanent random failure. On considering the number of devices in a digital 
system that are susceptible to failure in the ways depicted in figure 2 and combining software anomalies in 
a similar manner, one quickly begins to appreciate the designer's and the reliability analyst s tasks in 
accounting for these factors in reliability assessments. A rigorous discussion regarding some of the^e 
factors is given in McCluskey and Losq, 1978. 

From a reliability assessment viewpoint, it was not until recently that analysts began to account for 
these factors (Roth et al., 1967) tilth the probabilistic concept of fault coverage. Since then, numerous 
reports have appeared on the effects of fault coverage accountability (Ultra-Systems, Inc., 1974; Bavuso, 
1975; and fijurman et al., 1976). 


2 • REIJAIilLI T YJ1 ()DJLINC l APPROACH 

Reliability modeling research at the NASA langlcy Research Center has boon strongly influenced by our 
fault tolerant computer archi tectural research program which commenced circa 1971 with the initiation of a 
study on flu- • -.ign of a Fault Tolerant Airborne Digital Computer (Wonsley et al., 1973, and Ratncr et al., 
1973). Thiv .* •!/ identified two potentially viable computer arch i lectures for aircraft flight control 
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appl ications. They are the SIFT (Software Implemented Fault Tolerance) and the FT HP (Fault Tolerant 
Multiprocessor) (Wens ley et al. f 1978, and Hopkins and Smith, 1975 and 1978). Both architectural concepts 
utilize multiple LSI (Large Scale Integration) processor and memory devices, resulting in a large number 
of SRU’s (Smallest Reconfigurable Units). From a reliability modeling point of view, this scheme contrib- 
utes heavily to the modeling complexity by increasing the number of possible operational hardware states. 
This state of affairs has focused our research in the direction .of developing modeling techniques that are 
applicable to large-state models. For convenience, this modeling thrust will be referred to as the 
structural analytic approach. A parallel effort to the structural analytic approach was initiated by a 
study in 1973 which produced the Computer-Aided Reliability estimation (CARE II) computer program. To 
date, the CARE II fault coverage model represents the most advanced generalized model published in the 
open literature. It was this study which launched the Langley fault coverage modeling approach. 

Because it is anticipated that viable fly-by-wire digital fault tolerant systems for aircraft flight 
control will be required to meet unreliability requirements of (less than or equal to) 10“ 9 per flight and 
to be practical (less than or equal to) 10*9 at 10 hours, reliability models must be implemented in analytic 
form in lieu of simulation models; however, the use of very high speed emulators and/or parallel computers 
may at some future time diminish the analytic approach's dominance. This is not to say that simulative 
techniques are not presently applicable in reliability modeling. On the contrary, simulation plays a major 
role in determining vital reliability parameters associated with fault coverage modeling. 


3. STATE-OF-THE-ART MODELING PROBLEMS 

The state-of-the-art of structural analytic modeling of large systems is typified by the reliability 
analysis method employed in the ARCS (Airborne Advanced Reconfigurable Computer System) study (Bjurman 
et al.» 1976). The solution technique is matrix oriented and is based on constructing a similarity rela- 
tion such that the transition matrix is similar to a diagonal matrix containing the eigenvalues along the 
diagonal. For pure-death Markov processes with distinct eigenvalues, this solution method is extremely 
fast in a general purpose digital computer and, thus, very attractive for use in large-state space models. 
With some minimal care in assigning failure rate data so that, for all practical purposes, the system 
eigenvalues are mathematically distinct, this solution scheme is applicable to a large class of computer 
architectures of practical interest. Such a system is the FTMP which v/as analyzed at Langley using the 
described method. An abbreviated state transition diagram for the FTMP appears in figure 3 where a system 
state is defined as the 6-tuple vector, (a,b,d,c,e,f ) , where 

a - number of working processor modules 
b = number of processor modules in a recovery state 
c - number of working memory modules 
d = number of memory modules in a recovery state 
e = number of working bus modules 
f = number of bus modules in recovery 

and the SRU's are the processor, memory, and bus modules. Initially the system is in state . 

(10, 0, 10, 0, 5, 0) and the final state is (5, 0, 2, 0,. 2, 0). Further loss of hardware is considered 
system failure since crucial flight functions cannot be effected. Elements, b, d, and f describe states 
involving recovery. In addition to system loss resulting from hardware depletion, system failure occurs 
(in this model) when a second fault occurs within a recovery interval. This condition was imposed because 
the FTMP's primary fault detection and isolation mechanisms are based on a functional level software 
majority voting scheme. In actuality, the FTMP can recover from many double failures; however, the double 
failure constraint was necessary to reduce the state size of the reliability model; fortunately, it also 
produces a conservative reliability estimate. Several other necessary conservative assumptions were 
required to bring the state size down to a manageable level; in this case, a 865-state model resulted. 
Although 865 states for a reliability model is considered very large by industry standards, this analysis 
presented no problem for our Control Oata Corporation CYBER 175 computer. In fact, a mission time of 
10 hours required only 74 CPU seconds. 

Aside from the surprising low CPU time of such a complex analysis, another unexpected outcome 
resulted and is shown in figure 4. The probability of system failure in 10 hours is plotted against 
processor failure rate per hour for the 865-stote model with 10 processors, 10 memories, and 5 buses; and 
for a 673-state model with 10 processors, 8 memories, and 5 buses. The data show that the addition of 
2 memory modules increased the system probability of failure. This trend also applies if in lieu of 
"processor" appearing in figure 4, "memory" or "bus" is plotted. One explanation for this unexpected data 
is the sensitivity of the reliability model to the occurrence of a second fault during recovery. Beyond a 
particular hardware complement , increasing hardware redundancy diminishes system reliability because of 
the increased likelihood of additional faults. If the constraint that a second fault occurring in a 
recovery interval fails the system were relaxed, the results will change ir. favor of increasing redundancy. 
The penalty for increased realism is a considerable increase in the model state size. To date, a practical 
upper bound on the state size for the matrix solution technique previously discussed has not been explored. 
On the pessimistic side, it is sobering to realize that the 865-state model was reduced from approximately 
10 million states through the imposition of certain conservative constraints on the model. 

The state-of-the-art of reliability modeling of large systems has progressed one step beyond that 
already described to include transient faults. This amounts to adding the transient failure rate 
(transition rate) to hardware failure rates to account for persistent transient faults that behave like 
permanent faults (Bjurman et al., 1976, and Ng, 1976). The reliability contribution due to the time the 
machine spends in the recovery state because of a transient is not accurately modeled: As most analyses 
assume constant transient transition rates, one can ignore the recovery state and combine the transient 
transition rate with the permanent fault transition rate. 

This scenario of the state-of-the-art of reliability modeling for fault tolerant systems surely must 
convey the notion of modeling inaccuracies, not to mention the conspicuous alvence of any discussion of 
softv/are anomalies and other anomalies portrayed in figure 2. tven though the reliability analyst makes 
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every attempt to be conservative when he cannot be accurate, more often than not he is forced into a 
comproiu sing position that, raises doubt and diminishes confidence in the analysis. 


4. A NOV EL APPROACH l OR fcfl I AP II ! JYJ’iSSJSSMENT 

tolnrinf ttJ dr ! h be ; ny dr ! ven b y thc nccd for analytic techniques capable of modeling fault 

tolerant systems with state sizes on the order of 1000, to include sensors, actuators and their 
computer interfaces. There is mounting evidence that certain electronic devices exhibit nonconstant 
( I"“ ,n9, 9?5> ar ' d Sh00n:an ’ 1974 > ; mechanical and hydrauHc devices comnon y 
exhibit wearout, i.e.. increasing hazard rates with time. Those observations coupled wUh the need 
° 5 * ccount f ^lt latency, intermittent/transient faults! and sofJwSrffaiUures p?lsent 

strong case for an analytic technique capable of modeling nonconstant hazard rates. 


The development of such a technique is currently under study and will result in the development of a 
General Computer-Aided Reliability estimation (CARE III) computer program. The desire to reduce the large 
state sizes for Markov processes vis-a-vis CARSRA (Computer-Aided Redundant System Reliability Analysis, 

• U ?m M al ” 1976 ^ and thc nced t0 treat noncon stant hazard rates directed the study toward a general- 
ized Markov process concept, namely the processes in which the Chapman-Kolmogorov equation holds: 


P *i (t * T ] - 2 P vi (s ' T -' P tv (t - s) 

V 


mtn It in'smfT t iS (?e e ile r0 r ba ?95j) y “ 1 * 11 ? stat % * at time t given 


at 


-P ti (t,T) A £ .(t,T) + ]> Pj^t.T) Cj^tt.xjAjjj-tt.T) 

in 


Jf the notation indicating the condition that the* 
following recursive equation results: 


system be in state i at time x be suppressed, the 



where 


p £(t) = probability of being in state £ at time t 
Aj^ft) = transfer rate from state j to state £ 

MO ■ 1 M (t) 

j 


c j£,(t) = coverage associated with a failure which, if coverage were perfect, would cause a transfer 
from state j to state £ 


The system reliability is given by 

R(t) - £ P 4 (t) 

£tL 

for the set L of allowable states. 

From a computational point of view, a more occurate form is obtained by letting 

Q,.(t) - p ;;(t) - p 4 (tj 

where P ^ ( t ) * r ? { t ) given perfect coverage. The system unreliability Q(t) is given by 
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Q(-) ■ 1 - Mt) - £ Q 4 (t> + £ P*(t) 
fcL «e[ 

with L u L being the set ol all possible states. And cj t (t) - 1 - C j t (t) so that 


Q t (t) ‘ e" 




ft 

r v 


j; 

)d I 

- 

1 «>jb) « PjOJCj.tT) 



f ^{.(n)dn 

-Jn 


dt 


The virtues of this scheme are that the .hazard rate \j ? (t) and coverage c u (t) are time dependent; also 
the contribution to system unreliability due to perfect and imperfect coverage is decoupled. The need for 
the \j£(t) was previously discussed, but the importance of c j t ) was not presented. 

cTMn In /c!r? ic systen,s which utilize dynamic resource allocation schemes such as is possible with the 
fi^ n V FT s ^ te,ns ,:.r pr °P0 rt10n of hardware and software resources is dependent on the aircraft 

19 J l cnvGlope - n . 19ht critical phases require greater hardware redundancy and fault 
monitoring. The latter factor appears in reliability models as time-varying coverage c-«(t). A more 
subtie need for Cj £ (t) is to account for fault latency. The probability of system failure due to 
insufficient coverage is a function of the number of existing failures embedded in the system That is 
the probability of a second SRU (processor, bus, memory) failure occurring during the / second recovery 
time is a function of the number of SPJ's functioning at that time. * 

Preliminary studies of the Kolmogorov technique are encouraging from an accuracy viewpoint and com- 
puter run time. Figures 5 and 6 compare FTMP reliability data generated with the Kolmogorov technique 
against data generated with other conventional techniques. To make a meaningful comparison, c- p (t) 
and Aj £ (t) were constrained as constants in the Kolmogorov technique. It is suspected that tile dis- 
crepancies depicted in figure 6 are attributed to simplifying assumptions required to keep the conven- 
tional analysis technique tractable. 

. . Current work on CARE III is directed toward developing a coverage model compatible with the Kolmogorov 

technique and is based to a large extent on the CARE II coverage model (Raytheon Company, 1974 and 1976). 
Improvements to be sought are modification for coverage time dependency (cj^(t)) to model latent faults 

rflDr°[T 9reater difficulty, t0 reduce the burden placed on the user in defining input data for the modified 
CARE II coverage model. A third improvement is to include a more sophisticated intermittent/transient 
fault coverage model and if possible a software failure model. 

/i>ruJ he T CA *? 11 co !f era9e model is a Powerful basis upon which to build the Kolmogorov coverage model 
unr. r tS compl ? ted form * Jthe KCM will determine coefficients for the Kolmogorov reliability model 
IKKfcL-M). Coverage is conceived as consisting of three fundamental processes, system fault detection 
tault isolation to the SRU, and recovery, which may require hardware replacement and/or software corrcc- 
tion. Failure to properly effect one of those processes constitutes a coverage failure which is usually 
modeled as a system failure. A faithful coverage model must provide the mechanisms by which the relia- 
bility analyst can relate the coverage coefficients to the system factors that affect coverage. These 
factors include the fault classes (permanent/intermittent hardwarc/sof tware faults), the system fault 
° nr mn mnn — r /u, -.1. - voting, sof twa re sel f -inon i tor i ng , BITE (Built In Test Equipment), 

(similar to detection), and recovery procedures (hardware replace- 


detection mechanisms (software/hardware voti 

etc.), SRU fault isolation mechanisms 

ment, instruction retry, etc.). Detectors are modeled as competitors in the detection process. Every 
detector has some chance of discovering a fault; however, most detectors usually are specialized for a 
particular class of faults. In CARE II, this modeling process is under user control. It is assumed in 
coverage model that the detector which discovers a fault is most capable of defining fault isolation and 
recovery strategies. These strategies are user defined. 

The CARE II coverage model takes the following form: 


the 


c *b.J) ■ p i p sx j ' lf V J J 0 9 i (t)h i (T' - jx $x )r^ ( t ,t ' )dT d t 1 

where 

~ conditional probability system can recover from a fault in stage x given the fault 
belongs to fault class j and is detected by detector l * 

t = detection time 

- isolation time 

p sx = defective spare detection 

t sx = spare unit test time 


*A stage is defined a 


a set of identical devices. 
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Pi s noncompetitive detection probability 

Pi’ = Isolation probability associated with Pj 

hi - isolation rate 

r 1 - recovery probability 

9i - competitive detection rate 

Of all of these parameters, gj( T ) is the most difficult to obtain because it is a function of 
detector 1 and the entire ensemble of detectors and their interrelationships. 

5. ACQUISITION OF COVERAGE DATA 

caoab^r"?i^> modifying the CARE II coverage model for the KREL-M, some difficulty in using this 
IrWoci ty remains. Eventually the analyst must obtain coverage data peculiar to the system of 

| d f a d l' C U T nt ’ y m,Uir ? d: i ntel 'mittent hazard rite data inciuding duration 

data n Th^-I “I 1 detectlon do '. ,s ' t ’ cs for various classes of faults and detectors, and software hazard rate 

thispaper and ^wirno2 C be r add?essed S fu?t^er. f ’ rSt tW0; 3 dl ' scussion on the third 1* beyond the scope of 

a Hat? ’ nte ™’ tter, t ^ arrival data has been identified and work has recently coranenced to qenerate 

a data base of intermittent field hardware failure data in digital electronics*. The long-tlm aim of this 

ssw^s K“b:m, , SLr ,r< ' me '° r * « «<»>*■> *»<»• 

Beyond the pressing issues surrounding software reliability, validity and/or validation rhararteri™ 
!“fc n ff f ? UU ran n kS 1n ^ importance to the eventual success oF^illz ng d gUaf^^s fo^ 
; ' ®“"V thC nea : ^ possible machine stages thtl a^dlgital 

mine its £,i? ? h ! ? >1* ° f failure s> it ’S impossible to exhaustively test such a de-'ice to detcr- 

» 3 k!)‘- Therefore the Presence of undetected faults is always a possibility and for svstems 
designed to obtain system probabilities of failure of less than 10"’ in 10 hours of fliqht ever/small 

with thllo’th of .! at ? nt fadlt ^ occurring can have a large effect on system reliability. 9 "^ is certainly 
with these thoughts in mind that designers incorporate redundancy; however, the cost of constructing ^ 
mach nes which tolerate more than three coexisting manifested fau Its becomes prohibitive An actable 
t t0 c ? nsta \ t,y searcb for fau,t5 a " d eliminate their effects so tha^X wrtine is pie- 

sati?fipH th th W °H l '°® X1Stin9 msuifested faults, i.e., only one at a time. To insure that this goal is 

adeauatc^fauH dotI?!?n, T5 haVe 3 prior ’ k " cwlcd9 ° of fault occurrence and manifestion rate! so that 
adequate fault detection and recovery mechanisms can be incorporated. 

in at T l!I!t a o!e^nwn r w?L det h Cti ° n sdl ? OTes; the . m ? st o bviojs Is compariscn/voting and can be implemented 
rnmnn+In 3 ^ °[ way *; by executing a special software test and comparing expected results with 
computed results (self-monitoring) or two or more uniprocessors can compare functional level outputs durina 
no^a computation where both processors are executing the same code. The time be?ween fault Occurrence 
tho mIrK^ etCC ^i? n 1S ? tcncy tinie - this time is short compared with the failure rate of SRITs then 

tte" « col^i« S SJ^ S fa e ili?l! C fai,UrGS and have SUffiCient time t0 COpe with them - L °" 9 laie ” c y 

F.„it I nJl? ttt ^ Pt tddebG ™j no methods of acquiring latency data, a study entitled, "Modeling of a Latent 
10 a H D ' 91ta I Sy ^ tc ".‘ was conducted (Nagel, 1978). A very simple computer (V$C) modeled at 
the gate level was designed and simulated to execute on a CDC CYBER 175 host computer Six simple nrn 
grams were written using the VSC that consisted primarily of the fol lowlJIS InsSSctlonss P P 

Fetch and store 

Add and subtract 

Shift right and shift left 

AND and OR 

Indirect addressing 

Overflow indicator 

Branch 

Copy to and from temporary storage 

thP V Tnn°? eCUt f d ? aC \° f , thc si * programs, single faults were induced random uniformly over 

inithiL Input, output, stuck-at-one, and stuck-at-zero faults were equally likely occurrences. 

Initially tho number of runs manifesting faulty output was recorded and produced the following results: 


PROGRAM 

Fibonacci (FIB) 

Fetch and Store (F&S) 
Add and Subtract (A&S) 
Search and Compute 
Linear Convergence 
Quadratic 


SAMPLE SIZE 

DETECTIONS 

ESTIMATED 

DETECTION 

PROBABILITY 

ESTIMATED 

STANDARD 

DEVIATION 

211 

98 

0.464 

0.034 

118 

42 

.356 

.044 

208 

117 

.563 

.034 

118 

64 

.542 

.046 

133 

78 

.586 

.043 

97 

55 

.577 

.050 


NASA Contract Number, NASI -15574 with Sperry Uni vac. 
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Extensive data analysis was performed to explain the olf.ivvt\i fliffrroitf.es in terms of the number of 
executed instruct ions, the number of different instructions used in computation, the degree of branching, 
the fault modi' (slue k-at-one o»' zero, input or output), and number size. The results of the statistical 
analyses Indicate that latency time, or equivalently, detection capability, depends primarily on the 
instruct ion subset used during computation and the frequent v >f its use. Moreover, little direct 
dependence was observed for such factors as fault mode, number size, degree of branching, and program 
length. An exponential model was proposed and applied to the data from three programs (Add and Subtract, 
Fibonacci, and fetch and Store), 

The exponential model is based on the density function of y = min (t, T), where t is the detection 
time measured in repetitions and T is the truncation time of test, and is given by: 


P 0 -\e‘ Xy y < T 

f(y) = < P 0 e* AT Ml 0 y = T (Q 0 " 1 - P 0 ) 

0 Elsewhere 

V. 

where 

P 0 = the detection probability 

Q 0 = the probability of nondetection for all time 

P 0 e’ XT = the probability of nondetection due to insufficient test time 

Values for P 0 and X were obtained using maximum likelihood estimators, enabling the following 
data to be generated: 


Program 

fo 

X 

VPp. 

A&S 

0.568 

0.577 

1.02 

FIB 

.474 

.491 

1.04 

F&S 

.371 

.398 

1.07 


A pictorial representation of this model is shown in figure 7 superimposed on the raw data in 
histogram form. 

If after careful testing, this method of measuring and modeling fault latency proves to be acceptable, 
an important set of coverage parameters will become available for reliability modeling. As an aside, this 
scheme also provides a method for synthesizing test programs both for pre-flight and in-flight monitoring. 


6. CONC LUDING R EMARKS 

Testing digital systems which perform flight critical functions is not a feasible method for estimating 
system rel iahil i ty. Analytic modeling of system reliability in conjunction with simulative techniques for 
coverage measurement appears to be the only al ternativo on the horizon. Accurate reliability estimates 
which account for such factors as latent faults, intermittent/transient faults, and software errors require 
sophisticated techniques which are currently being developed and will result in the KREL-M reliability 
assessment capability embodied in the CARE III computer program. The effects and significance of these 
factors on the reliability of fault tolerant digital systems are yet to be determined, and the potential of 
increased complexity brought about by the inclusion of these factors in an assessment capability such as 
KREL-M is a major concern. It is anticipated that after extensive trade-off analyses, KREL-M will be 
simplified and take on more of the characteristics of a production tool in lieu of its initial experimental 
character. 

In a parallel effort, methods for acquiring indispensable coverage data required by KREL-M are now 
becoming available. 
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CATEGORY 


1. TYPE & MANIFESTATION: 


2. CAUSE: 


3. SYSTEM EFFECT: 


4. DEFENSE 


HARDWARE ANOMALY 

• PERMANENT 

• TRANSIENT 

• INTERMITTENT 

DESIGN ERROR 
FABRICATION ERROR 
RANDOM FAILURE 
EXTERNALLY INDUCED 

• SIGNAL ERROR 

• POWER FAILURE 

• PHYSICAL FAILURE 

• EMI 

COMPUTER SYSTEM CONTROL LOSS 
APPLICATION COMPUTATION ERROR 
NONE 

HARDWARE REDUNDANCY 

• SPATIAL -ALTERNATE 

HARDWARE 

• TEMPORAL - RETRY 
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Figure 1. Factors affecting coverage. 
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Delineation of hardware and software anomalies 




Figure 3. FTMP state transition diagram. 
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Fiqure 4. Probability of system failure versus processor 
failure rate for the FTMP. 





Figure 6. Probability of system failure versus operating time 
for the FTMP. 






TIME TO DETECT (REPETITIONS) 


Figure 7. Exponential model of fault latency (detection). 
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